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1. OVERVIEW 


This Quarterly Technical Report, Number 8, describes aspects 
of our work on the ARPA Computer Network under Contract No. F08606- 
73-C-0027 during the fourth quarter cf 1974. (Work performed from 
1969 through 1972 under Contract No. DAHC~69~C~0179 has been re- 
ported in an eavlier series of Quarterly Technical Reports, num- 
bered 1-16.) 


During this jJuarver three new network nodes were installed. 
Interestingly, all three nodes were installed at sites which al- 
ready possessed network nodes; the motivation for the new node 
equipment was the decire to attach more Host computers than can 
be supported by a single mode. I< is therelore noteworthy that 
the new Pluribus IMP, whien Is essentially ready for field instal= 
lation, has the physicai ability to handle many more Host connec- 
tions than the current line of Honeywell-based IMPs. The new 
nodes this quarter include IMPs at the Seismic Data Analysis 
Center and Stanford Research Institute, and a TIF av the USC 
Information Sciences Institute. At the end cf the quarter the IMP 
located at Case Western Reserve University was removed from the 


network. 


MoSv- Of the amplemenvatton of Host. access controls and alco= 
rithms Tor equal distribution of IMP resources amcene Hosts was 
Completed during the sfourtin juerver.. Due Co aititicult oro lems 
in the final target system, nowever, the complete implementation 
Of these Mmechanisrise Was Dest pONned Ul ut dsche [arst dueartver oF 2075. 
In addition, the IMP software was modified to permit, under econ~ 


fLroiled experimentai conditions, Host use of packets for which 


message processing is not performed. 
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Early in the fourth quarter we began field testing of the 
TIP software which implemented the access control ind user account- 
ing mechanisms described in our Quarterly Technical Report No. 6. 
This testing revealed several unanticipated difficulties, and ‘hus 
installation of this software did not begin until early December, 
about a month later than anticipated. By the end of the quarter 
the TIP software was installed in many operational TIPs, but some 
installations will be made in early 1975. The RSEXEC programs 
for maintaining the various access control and accounting data 


bases were operational by the end of the quarter. 


As noted in our Quarterly Technical Report No. 7, by the end 
of the third quarter the two production copies of the 14-processor 
Pluribus IMP hardware (currently 13 processors plus a paper tape 
reader to simplify testing) were essentially ready for delivery, 
and the software nearly so. During the fourth quarter software 
debugging and tuning continued, and toward the end of tne quarter 
we felt the system was ready for a field trial. Accordingly, one 
of the two production machines was "shipped" from our development 
area to tne: HBA Rese¢aren Computer Center. This will provide an 
environment similar to fieid insvatiation for testing during 
Carly 19/5.5  eCUron 2 Of Chis Peporusdeseribes ander a1 dt sour 


approach to system reliability issues in the Pluribus IMP. 


During the Dourth quarter wé published and distributed re= 
VrSVous UG Garee -Cperatzonal doOcumenus, dine wejor « .1G. ote, 
Specifications for the Interconnection of a Host and an IMr, BBA 
Re, ort #o. 2103, TIP User's Guide, and BBY Report Ho. 2184, TirP 
hardware Manual. In additicn, “ovr professional papers were pre- 
pared and submitted as follows: "The Evolution of Message Yrocess- 


ing Techniques in the ARPA Network," by J. M. McQuillan, to appear 
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in International Computer State of the Art Report No. 24: Network 
Systems and Software, Infotech, Maidenhead, England; "Pluribus -- 
A Reliable Multiprocessor," py S. M. Ornstein, W. R. Crowther, 

M. F. Kraley, R. D. Bressler, A. Michel, and F. E. Heart, sub- 
mitted to the AFIPS 1975 National Computer Conference; "Issues in 
Packet-Switching Network Design," by W. R. Crowther, F. E. Heart, 
A. A. McKenzie, J. M. McQuillan, and D. C. Walden, submitted to 
the AFIPS 1975 National Computer Conference; and "Some Considera- 
tions for a High Performance Message-Based Interprocess Communica- 
tion System," by J. M. McQuillan and D. C. Walden, to be presented 
at the ACM SIGCOMM-SIGOPS Interface Workshop on Interprocess Com- 
munications, March 24-25, 1975. 
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2. RELIABILITY ISSUES IN THE DESIGN OF THE PLURIBUS IMP 


As computer technology has evolved, syctem architects have 
continually sought new ways to exploit the decreasing costs of 
system components. One approach has been to pull together col- 
lections of units into multiprocessor systems. Usually the objec- 
tives have been to gain increased operatinggs power through paral- 
lelism and/or to gain increased systcin reliability through redun- 
dancy. 


In this section, we describe our approach to the objective 
cf nigh system reliatility in the multiprocessor IMF system which 
has been under development since 1972. Aspects of this new IMP 
nave been described in several of our previous Quarterly Technical 
Reports (in particular, Jo. 4). To review, the desifn objectives 


for the machine ere restated here. 


LNG Haciine Was. Cor be case le of jit aeyandiidcn, ine order to 
handle the 1.5-megabaud data circuits which were then planned for 
the network. It was to have a high fanout to Host computers con- 
NeCLCG au a Ned 2 itowas LO scom: ii alt sites (ol processing 
power, memory, £70). so that. one could Confiveure an aidividual IMP 
LO Meet Une requirements Of 10s vartieniar Location in the network, 
and change “nat configuration easily should the requirements change. 
Most of all, it was to be reliable. 


the family of machines we have developed to meet these goals 
has Deen nNameaG tne Piluyibus line. The machines are Lithly modular 
at several levels and have a minicomputer/muitiprocessor architec- 
ture as described in the past. We believe that reliability will 
become an increasingiy common concern as multiprocessors become 
more commonplace, and we believe that we have gained some inter- 
esting insights into the solution of this protlem which we describe 
here. 


| Report Nu. 2988 Bolt Beranek and Newman Inc. 


Tas sak SEPA ey a Rel Te 81 AMP 


| We will begin with a brief review of the structure of our 
systen. we then discuss our reliability goals and introduce 

| three broad strategies used to acnieve these goals. Next the sys- 
a tem reliabi tity structure is described, and finally we present a 

| number of «amples which illustrate how the system works. 


2.1 The Multiprocessor Architecture 


A novel featur: of our design is the consistent treatment of 
| all processors as equas units, both in the hardwire and in the 
: software. There 1s no specialization of processors for particular 
system functions, ard no assignment of priority among the proces- 
| sors, such as designating one as master. We chose to distribute 
among the processors nut only the application job ‘the IMP job) 
| but also the multi-processor control and reliability jobs, treat- 
ing all jobs uniformiy. We view the processors as a resource used 
to advance the algorithm; the idenvity of the precessor performing 
a particular task is of no importance. Programs are written as 
| for a single processor except th t the algorithm includes inter- 
locks necessary to insure mult‘processor sequentiality when re- 
quired. The software of our machine consists of a single conven- 
i tional program run by ali processors. Each’ processor nas Its own 
| local copy of about one quarter of this program and the remaining 
three quarters is in commonly accessible memory. 


; 2.1.1 Hardware Structure 


Reliability was a main concern in planning the hardware ar- 
chitecture. Aithough we tried to bui’d the irndivicual pieces 
solidly, our main goal was. to provide hardware which could be ex- 
ploited by the program to survive the failure of any individual 
component. 
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The hardware consists of busses joined together hy special 
bus couplers (see Figure 1) which allow units on one bus to access 
these on another. Each bus, together with its own power supply 
and cooling, is mounted in its own modular unit, permitting flexi- 
ble variation in the size and structure of systems. There are 
processor busses each of which contains two processors, each pro- 
cessor in turn with its own local 4K memory which stores frequently 
run and recovery-related code. There are memory busses to house 
the segments of a large memory common to all the processors. Fi- 
nally, there are I/O busses which house device ccentrollers as well 
as certain central resources such as system clocks and special 
(priority-ordered) task disbursers which replace the traditional 
Priorivy anverrupt system. ln a darve econii guration, about half 
of the machine consists of standard parts from the Lockheed SUE 


line; the remainder is of speciul design. 


We were fortunate to have a very specific Job in mind as we 
designed the system. This enabled us t place specific bounds on 
the problems we sought to solve. For example, the pron.sed ini- 

: tial setting within a communications network means that outside 
entities (neighboring communications processors, Hosts, users, 
etc.) may help to notice that things are going wrong. It also 
means that recovery assistance is potentially available from the 
Network Control Center (NCC) throuzth the network. The system is 
designed generally to avoid reliance upon external help, but upon 
occasion such help is useful and therefore we have provided meth- 
ods for permitting the system to be forcibly reloaded and restart- 
ed via the network. 
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Figure 1: Pluribus IMP (14-processor system) 
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2.1.2 Software Structure 


The problem of building a packet-switching store-ana~forward 
communications processor lends itself especially well to parallel 
solu:iox since packets of data can be treated independently of 
one another. Other functions of the IMP program such as general 
housekeeping, routing computations, reliability tasks, etc., can 
also be easily performed in parallel. Thus, we have been abie to 
devote ur attention to the problems of multiprocessor implementa- 
tion rather than struggling to force our algorithm into a parallel 
form. 


The structure that we nave chosen works as follows: First, 
the p ogram is divided into small pieces, called strtps, each of 
which handles a particular aspect vu. the job. tor example, one 
strip handles special routing messages from neighboring IMPs, 
another handles input from a loca? Host, and others handle further 
I/O and housekeeping functions. When a particular task needs to 
be performed, for instance upon receipt of a message over a com- 
munications circuit, the name (number) of the appropriate strip 
is put on a queue of tasks to be run. Each processor, when it is 
not running a strip, repeatedly checks this queue. When a strip 
number appears on the queue, the next available processor will 
take it off tne queue and execute the corresponding strip. We try 
to break the program into strips in such a way tat a minimum cf 
context saving, is necessary. 


Strips have different levels of importance. Data coming in 
over a high-speed communication circuit has higher priority than 
data coming in over a Teletype-speed line. The number assigned 
to each strip reflects the priority of the task it performs. When 
a processor checks the task assignment queue, it takes the highest 
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Priority job then available. since ail precessors access this 
Gueue frequently. the contention for th ts very pigh. Por vetii-— 
clenzy we therefore built a hardwure device called the Pseudo 
Interrupt Device (PID) which serves as a task queue. A single 
PISECUCE LOL allows the. ni snes priority asm tOsDbe terched and 
removed from the queue. Another instruction allows a new task to 
be put onto the ‘queue. All convention 1s arbitrated by standard 


bus logic hardware. 


The length of strips is governed by how long priority tasks 
CapieW aisbe Wi sole Ne DOCS SONG ware DUNGY cnn lhe WwOrstesCas eC aise s 
when all processors have just begun the longest strip. In the 
IMP application, the most urgent tasks can afford to wait a maxi- 
mum of 400 microseconds. Therefore, strips must not be longer 
than that. 


hmanherenc part of multhiprecessor operatioga 1s. vhe locking 
of critical resources. This is the mechanism by which the algo- 
rithm enforces sequentiality when it is needed. OQur system uses 
a, bead=and=clear Operation aS 2ts iprAmruive Jocking faciiity. 
1O avoid: d@adlocks, We assipn 2a prioricy ordering Co ovr resources 
and arrange that the software nct lock one resource when it has 


already locked another of lower or egual priority. 
C1232 "Status 


Two production Pluribus IMPs have been constructed and are 
running. Each contains 7 processor busses (generally, each holds 
2 processors), two memory busses, and two T/O busses. These ma- 
chines have been connected intermittently into the ARPA Network 
for testing purposes and one has now teen "Shipped" to the BBH 


Research: Computer Center for testing under field conditions. A 


“ingle Pluribus IMP has been running on the network for an extended 


terest tentat tant cod co 
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period in order to validate performance during routine operation. 
Three Satellite IMP configurations are presently under ccenstruc- 
tion. 


2.2 Reliability Goals 


Since the term "reliable system" can have many different 
meanings, it is important to establish clearly just what we are 
and what we are not trying to achieve. We are not crying to build 
a non-failing device; instead, we are trying to build a system 
which will recuperate automativally within seconds, or at most 

| minutes, following a failure. Furthermore, we want the system to 
| Survive not only transient failures but also solid failures of 
any single component. In many cases (such as the IMP job) it is 
not absolutely imperative to operate continuously and perfectly 
(as it may be in a space vehicle guidance computer); it suffices 
to operate correctly most of the t me so long as outages are in- 
frequent, kept brief, and fixed wit.out human intervention. 


HL Abe Hi eta eezepepsztsersszoneg eons sats 


How one copes with infrequent brief outages depends on what 
one is trying to do. For tasks not tightly coupled to real-time 
requirements (e.g., for man, large numerical computations), a 
Simple device is to choose checkpoints at which to record the 
state of the system so that one can always recover by restarting 
from the checkpoint just preceding an outage. The IMP system 
happens to be embedded in a lurger system which is quite forgiving 
(not an uncommon situation). Thus infrequent outages of a few 


seconds are tolerate? eusily, and outages of many seconds, while 


cauSing the particular nude to become temporarily unusable, will 


not in general jeopardize operation of the network as a whole. 
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Occasionally, despite all efforts, the system will break so 
catastrophically that it will be unable to recover. Cur goal is 
to reduce the probability of such total system failure to the 
probabiiity of a multiple hardware failure. We do not try to pro- 

| tect against all possibie errors; some of our procedures will fail 
to detect errors whose probability of occurrence is sufficiently 

| low. For example, all code is periodically checksummed using a 
16-bit checksum. A failure that does not disturb the validity 
of the checksum may not be detected. We do not mind if a failure 
renders large sections of the machine ur sable or inaccessible, 
providing, enough remains to run the system. The presence of run- 
nable tardware, however, is not sufficient to guarantee that opera- 
tion will be resumed; in addition, the softwa;*e must be able to 
survive the transients accompanying the failure and adapt to “he 
remaining hardware. This may include combating and overcoming 
active failures (for example, when an element such as a processor 


goes berserk and repeatedly writes meanirgless data into memory). 


All code is presumed to be debugged — i.e., all frequently 
occurring problems will have been fixed befcre the code is put 
into the field. On the other hand, we do assume that all manner 
of perverse and infrequent bugs wi:l forever be latent within the 
code. Thus, we must be able to survive even when all code, data 


structures, etc: , have been randomiyvy destroyed. 


In order to avoid complete system failure, a failed component 
must be repaired or replaced before its backup also breaks. ‘The 
system must therefore report all failures. The actual repair and/ 
or replacement will of course be performed by humans, but this 
Will generally take place long after the system has noted the 
failure and reconfigured itself to bypass the failed module. The 


ratio of mean-time-to-repair to mean-time-between-failures will 
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determine overall system reliability. It must also be possible 
£O remove and replace any component while the system ccentinues 
to run. Finally, the system should absorb repaired or newly 
introduced parts gracefully. 


2.3 Strategies 


In order to understamnu our system it is convenient to think 
of the strategies used to achieve our goals as divided into two 
parts; these more or less parallel the traditional division o: 
computer system into hardware and software. ‘The first part 
provides hardware that will survive any single failure, even a 
solid one, in such a way as to leave a potentially runnable ma- 
cuine intact ("potentially" implies that it may need resetting, 
reloading, etc.). The second part provides all of the facilities 
necessary to survive any and all transients stemming frem the 


failure, and to adapt to running in the new hardware configuration. 
2.3.1 Appropriate Hardware 


We have two basic strategies in nroviding tne hardware. The 
first is to include extra copies of every vital hardware resource. 
The second is to provide sufficient isolaticn between the copies 


so that any single component failure will impair only one copy. 


To increase effective bandwidth in multiprocessors, multiple 
copies cf heavily utilized resources are normally provided. For 
reliabiiity, however, al@ resources critical to running the algo- 
rithm are duplicated. Where possitlie the system utilizes these 


extra resources to increase the bandwidth of the system. 


it 28 not sufficient merely to provide duplicate copies of 24 
particular resource; we must also be sure that the copies are not 


dependent on any common resource. Thus, for example, in addition 
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vo providing multiple memories, we also inelude multiple busses 
on which the memories are distributed. Furthermcre, each bus is 
not only logically independent, but also hysicadlly modular. ‘The 
Chassis, with its own power supple and colvling, 1s bullt into an 
integral unit which may be poweren dewn, disconnected, and re- 
movec from the rack for servicing or replacement while the rest 


of the machine continues to run. 


All central system resources, such as the real time clock 
and the PID, are duplicated on at least two separat2 I/O busses. 
All connections between bus pairs are provided by separate bus 


COuUpIerS SO that. a @oupler Laliure can disable al srost the twe 


busses it is connecting; all other interconnection: between busses 
are unaffected. We have thus avoided the vulnerable centralized 
switch generally found at the heart of & multliproc2ssor. There 
are multiple racks, Gaci #ith Lvs Own power cOrd {to separate 
power sources if desired}, and busses are assipnei to racks in 


such a way that the less of even an entire rack does not remove 


With Tespecy eo non=—Contrel resources; Guch as iIneividual 
i/O interfaces, one has more freedom. When a particular line is 
G LiGerrace Undts 
9@ 3€lected for use 
Hol WOr~LD tne 


° it 2 f-  f7tte? 3 s ne tyr ~~ 4 es . os ¥ pedis Fa ‘ aa 
CyCra COSt., However, one i476 “eed be Chiy Sinetly connected. 
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in order for tne systen (lo adam to Ditvereny: hardware con— 
PLENUPalLions, Une Soltnare “ust be able toe deternine, wacvhout 
aid, what hardware resources ure availabls to it. We have 
Made 1G CONVElient LO Search Lom ane 26cave Trose resources whien 
are present and to de ) 


termine the type and parameters of those 


which are found. 
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Since we must allow for active failures, there must be a 
mechanism which allcus the program to turn otf or isolate a unit 
which is acting in a malevolent meiner. To arrange for this, all 
bus couplers have a program-controliable switch that inhibits 
transactions via that coupler. Thus, a bus may ve effectively 
"amputated" by turning off all couplers connected to that bus. 
These switches are protected from capricious use by requiring a 
password. Naturally an amputated processor has no access to these 
switches. 


A final rule that we have followed in maintaining udequate 
isolation is to prohibit any common signal that would connect to 
all busses. Normally a common reset line is considered essential 
in any hardware system; however, we have avoided 3 common reset 
line since @ Single failure on its driver could render the entire 
system inoperative. In our system there is no central point (not 
even a single power switch) where one can gain cuntrol of the 
entire system at once. Instead, we rely on resetting a section at 
a time using passwords. 


2.3.2 Software Survival 


With the above features, the Yluribus hardware can experience 
any single component failure and still present a runnable system. 
One must assume that as a consequence of a failure, the program 
may have been destroyed, the processors halted, and the hardware 
put in some hung state needing to be reset. We now investigate 
the means used to restore the algorithm to operation after a 
failure. The various techniques for doing this may be classified 
under three broad strategies: keep it simpie, worry about redun- 


dancy, and use watchdog timers throughout. 
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2.6.2.) Simplicity 


It is always good to keep a system simple, for then one has 
a better chance of making It work. We describe here three system 
constraints imposed in the name of simplicity. 


First, as mentioned above, we insist that all processors be 
identical and equal: they are viewed only as resources used to 
advance the algorithm. Each should be able to ao any system task; 
none should be singled out (except momentarily) for a particular 
function. The important thing is the algorithm. With this view 
it is clear that it is simplest if the algorithm is accessible to 


all processors of the system. A consequence of this is that the 


full power of the machine can be trougnt to bear on the part of 
the algorithm which is busiest ac a given time. 


One might argue that for some systems it is in fact simpler 
“4 (or more efficient) to specialize processors to specific tasks. 

-- One could, in such a case, then duplicate each different type for 
i reliability. With that approach, however, one must worry about 
the recovery of several different units, and all the possible 
interactions between them. We consider the recovery problem for 


| a group of identical machines formidable enough. 


* One consequence of treating all precessors equally is that a 
< program can be debugged on a single machine up to the point where 
i the multiple machine interaction matters. Once this has been 


done, we have found that processor interaction does not present 
a-severe additional debugging probiem. On. the otner hand, finding 
routine software bugs when a dozen machines are running is a dif- 
ficult problem. 
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A second characteristic of our system which arose from a 
desire to keep things simple is pessivity. We use the terms ac- 
tive and passive to describe communication between subsystems in 
which the receiver is expectec to put aside what it is doing and 
respond. The quicker the required response, the more active the 
interaction. in general, the mor2 passive the communication, the 
simpler the receiver can be, because it can wait untjl a conve- 
nient time to process the communication. On the other hand the 
slower response may complicate things for the sender. We believe 
that there is a net gain in using more passive systems. An ex- 
ample of this is our decision to make the task disbursing mecha- 
nism (the PIY¥) a passive device. Neither the hardware interfaces 
nor other processors tell a processor what to do; rather, pro- 
cessors aSk the PID what shculd be done next. There are some 
costs to such a passive system. The resultin; slower responsive- 
ness has necessitated additional buffering in sone of our inter- 
faces. In addition, the program must regularly break from tasks 


being executed to check the PID for more important tasks. 


The alterr:atives, however, are far worse. In amore active 
multiprocessor system, for example one which uses classical pri- 
OPity ianterrupts, i is difficult to decide which processor to 
switch to the new task. Furthermore, it is almost impossible to 
preserve the context of a processor while making such a switch 
because of the interaction with the resource interlocking system. 
The possibilities for deadlocks are frightening, and the general 
mechanism to resolve them cumbersome. With a passive systema 
processor finishes one task before requesting the next, thus 
guaranteeing that task switching occurs at a time when there is 


little context, e.g., no resources are locked. 
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Passive systems are more reliable for another reason: namely, 
the recovery mechanisms tend to be far simpler than those for ac- 


tive systems. 


As a third example of simplicity we introduce the notion of 
a reliability subsystem. A reliability subsystem is a part of the 
overall system which is verified as a unit. For example, there is 
a subsystem that, among other things, determines which common mem- 
ory pages are available for message buffers. The entire system is 
broken into these subsystems, which verify one another in an or-- 


derly fashion. 


The subsystems are cleanly bounded with well-defined inter- 
faces. They are self-contained in that each includes a self-test 
mechanism and reset capability. They are isolated in that all 
communication between subsystems takes place passively via data 
structures. Complete interlocking is provided at the boundary of 
every subsystem so that the subsystems can operate asynchronously 
with respect to one another. 


The monitoring of one subsystem by another is performed us- 
ing timer modules, as discussed below. These timer modules guar- 
antee that the self-test mechanism of each subsystem operates, 
and this in turn guarantees that the entire subsystem is operating 
properly. 


2.3.2.2 Redundancy 


Redundancy is simultaneously a biessing and a curse. It oc- 
curs in the hardware and the software, and in both control and 
data patns. We deliberately introduce redundancy to provide re- 
liability and to promote.efficiency, and it frequently occurs be- 
cause it is a natural way to build things. On the other hand, 
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the mere existence of redundancy implies a possible disagreement 
between the versions of the information. Suen inconsistencies 
usually lead to erroneous behavior, and often persist for long 


periods. 


It was not until we adopted a strategy of systematically 
searcning out and identifying all the redundancy in every sub- 
system that we succeeded in making the subsystems reliable. This 
process therefore corstitutes one of our three basic strategies 


for constructing robust softwiure. 


We use the term redundancy here in a somewhat subtle sense, 
not only for cases in which the same information is stored in two 
places, but also when two stored pieces of informition eacn imply 
a common fact although neither is necessarily sufficient to imply 
Cine 20 tid 21s 


Let us consider a few examples of redundancy to make these 


ideas more concrete: 


- A bpuffer Nolding @ message to be processéd, and a pointer 
to the buffer on a "to be processed" queue. (One can 
easily deduce from the state of our buffers what process-~- 


ing they need.) 


* A device on the bus requesting a bus cycle, anda flip- 


fLOpsa eh hen DUS arb ter capturing tune lack of sche request. 


* One processor seeing a memory word at a particular system 
address and another seeing the same word at the same 


address. 


* The PID level which a particular device use. to request 
service anda the device which the software services in re- 


sponse to that level. 
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There are several methods of dealing with redundancy. The 
first and best is to eliminate it, and always refer to a single 
copy o. tne information. When we choose not to eliminate it, we 
can check the redundancy and explicitly detect and correct any 
inconsistencies. It does not really matter how this is done as 
the system is recovering from a failure anyway. What is important 
is to resolve the inconsistency and keep the algorithm aoving. 
Sometimes it is too difficult to test for inconsistency; then 
timers can be used as discussed in the next section. 


We can now see how these methods are applied to our four 


examples: 


* If the buffer and queue are inconsistent, the buffer will 
not be processed. Each buffer has its own timer; if the 
buffer is not processed in a reasonable time, it will be 
replaced on the queue. 


¢- If the bus arbiter and devices disagree, the bus may hang. 
We have added a timer which times out bus transactions and 
does a bus reset after one second of complete inactivity. 


* The software verifies that all processors see the same 
memory at ‘:: given address and when they do not, decides 
whether to declare the memory or one of the processor's 
unusable. 


¢ An inconsistency between ‘the PID level that an interface 
is using and the software response to that level cannot 
persist because a process periodically forces the tables 
driving the response to agree with the hardware. (The 
PID level(s) used by each device are program-readable. ) 
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2.3.2.3 Timers 


We have adopted a uniform structure for implementing a moni- 
roring function between reliability subsystems based on watchdog 
timers. Consider a subsystem which is being monitored. We design 
such a subsystem to cycle with a characteristic time constant and 
insist that a complete self-consistency check be included within 
every cycle. Regular passage through this cycle therefore is 
sufficient indication of correct operation of the subsystem. If 
excessive time goes by without passage through the cycle, it im- 
plies that the subsystem has had a failure from which it has not 
been able to recover by itself. The mechanism for monitoring the 
cycle is a timer which is reset by every passage through the cycle. 
We have beth hardware and software timers ranging from five micro- 
seconds to two minutes in duration. Another subsystem can monitor 
this timer and take corrective action if it ever runs out. To 
avoid the necessity for subsystems to be aware of one another's 
internal structure, each subsystem includes a reset mechanism 
which may be externally activated. Thus corrective action con- 
sists merely of invoking this reset. The reset algorithm is as- 
sumed to work although a particular incarnation in code may fail 
because it gets damaged. In such a case another subsystem (the 
code checksummer) will shortly repair the damage. 


Note that we have introduced an active element into our 
otherwise totally passive system. These resets constitute the 
only active elements and furthermore are invoxed only after a 
failure has occurred. This approach seems to provide for the 
maximum isolation between subsystems. 
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2.4 System Reliability Structure 


In the previous section we described a mechanism whereby 
One subsystem can monitor another. Our system consists of a chain 
of subsystems in which each subsystem monitors the next member of 
the chain. Figure 2 and Table 1 show this structure in the system 
we have built for the IMP. An efficient way to build such a chain 
is to have lower subsystems provide and guarantee some important 
environmental feature used by higher level systems, For example, 
a low level in our chain guarantees the integrity of code for 
higner levels which thus assume the correctness of code. Such a 
System is vulnerable only at its bottom. (We are assuming here 
that we have runnable hardware although it may be in a bad state, 
requiring resetting.) The code tester level (see Figure 2) 
serves three functions: first, it checksums all low level code 
(including itself); second, it insures that control is operating 
properly, i.e., that all subsystems are receiving a share of the 
processor's attention; third, it guarantees that locks do not 
hang up. It thus guarantees the most basic features for all nigh- 
er levels. These will, in turn, provide further environmental 
features, such as a list of working memory areas, I/O devices, 
etc., to still higher levels. The method by which the code tester 
subsystem itself is monitored and reset will be discussed shortly. 


The mechanisms we have described ensure that the separate 
processor subsystems have a satisfactory local environment in 
which to work. Before they can work together to run the IMP 
algorithm it is necessary that a common environment be stablished 
for all processors. We call the process of reaching an agreement 
about this environment "forming a ~onsensus", and we dub the group 
of agreeing processors the Consensus. The work done by the Con- 
sensus is in fact performed by individual processors, but the 
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Figure 2: Pluribus Subsystem Monitoring Structure 
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Table 1: Major Subsystems and Their Functions 


IMP SYSTEM: Watches network behavior — will not cooperate 
with irresponsible ...twork behavior. 


IMP SYSTEM RELIABILITY: Watches IMP SYSTEM (data structures mostly) 


CONSENSUS: Watches IMP SYSTEM RELIABILITY, verifies all Co.nmon 
Memory Code (via checksum), watches each processor, 
finds all usable hardware resources (interfaces, 
PIDs, memory, proc 2ssors, etc.), tests each and 
creates a table of gocd ones. Makes spare copies 


of code. 


INDIVIDUAL: Watches CONSENSUS, finds all memory and processors 
it considers usable, determines where the Consensus 
is communicating and tries to join it. 


CODE TESTER: Watches INDIVIDUAL, verifies all Local Memory Code 
(via a checksum), guarantees control and lock 


1echanisms. 


BUS TIMER + 60Hz INTERRUPT: Watches CODE TESTER, guarantees bus 
activity. 
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coordination and discipline imposed on Consensus members make them 
behave like a single logical entity. An example of a task re- 
quiring consensus is the identification of usable common memory 
and the assignment of functions (code, variables, buffers, etc.) 
to particular pages. The members of the Consensus will not al- 
ways agree in their view of the environment, as for example wnen 

a broken bus coupler blinds one member to a segment of common 
memory. In this case the Consensus, including the processor with 
the broken coupler, will agree to run the main system without 

that procesSv.. 


The Consensus maintains a timer for every processor in the 
system, whether or not the processor is working. The Consensus 
will count down these timers in order to eliminate uncooperative 
or dead processors. In order to join the Consensus, a processor 
need merely register its desire to join by holding off its timer. 
Within the individual processors it is the code tester subsystem 
which holds off the timer. 


The Consensus, then, acting as a group, provides the monitor- 
ing mechanism for the individuals as shown by tne feedback monitor- 
ing path in Figure 2. This monitoring mechanism run by the Con- 
sensus includes the usual reset capability, which in this case 
means reloading the individual's local memory and restarting the 
processor. Since all of the processors have identical memories, 
reloading is not difficult. We provide (password protected) 
paths whereby any processor can reset, reload, and restart any 
other processor. ‘This reliance on the Consensus is indeed vul- 
nerable to a simultaneous transient failure of all processors. 
However, the Network Control Center has access to these same re- 
set and reload facilities and these enable it to perform the 
reload function remotely (a path also shown in the figure). 
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Thus the Consensus and/or Network Control Center are thc 
ultimate guarantors of the lowest level subsystem. While this 
process is sufficient it is sometimes slow. For many cases in 
which the Consensus is disabled (as for example when all of the 
processors halt), a simpler reset without reloading will suffice. 
For this reason we have provided a simpler and more immediate 
(if redundant) mechanism in each processor for resetting the con- 
trol and lock systems. We implement this mechanism in software 
with the assistance of a 60Hz interrupt and a one-second timer 
on the bus. Together these provide a somewhat vulneratle but 
much quicker alternative to the more ponderous NCC/Consensus 


resets. 


There is a problem about what area of common memory the pro- 
cessors should use in which to form the Consensus, sincc failures 
may make any predetermined system address inaccessible. To allow 
for this, sufficient communication is maintained in ali pages of 
common memory to reach agreement both as to which processors are 
in the Consensus and where further communication is to take place. 


To protect itsel? from broken processors, the Consensus ampu- 
tates all processors which do not succeed in Joining it. There 
is a conflict between this need to protect itself and the need to 
admit new or healed processors into the Consensus. The amputation 
barrier is therefore lowered for a pvrief period each time the Con- 
sensus tries to restart a processor. This restart is in fact the 
reset based on the timer held off by the code tester subsysten, 
as discussed above. In the case of certain active failures, 
even this brief relaxation may cause trouble. In these cases the 
Consensus will decide to keep the barrier up continuously. 
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Certain active failures may prevent the formation of a con- 
sensus. In such a situation each processor will benave as ir it 
were a Consensus (of one) and will try to amputate all other pro- 
cessors- At the pcint when the actively failing component is 
amputated, the remaining processors will be able to form a con- 
Sensus. From this point the system behaves as described above. 


Further up in the figure there is another joining of in- 
dependent units, namely IMPs joining to form the network. The 
analogy here is incomplete because the ARPA Network was not bulit 
with these concepts in mind. There is collective behavior, for 


example routing, and individual behavior which accepts collective 


decisions only after they pass reasonability tests. However, the 


reliability features of the network are concentrated in the Net- 
work Control Center, which depends on the continual presence of 
human operators and technical staff for successful operation. 


2.5 Some Examples of Failures 


In order to explain in more practical terms some of the re- 


liability mechanisms, we now discuss a number of specific failures 


and describe the methods which detect and repair the resulting 


damage. In each case, we focus on the component that failed and 


the particular m2chanism that takes care of that failure. Deriva- 


tive failures may well take place, and other mechanisms will handle 


these, since all techanisms operate all the time. The severity of 


the direct consequences of these examples is rated on the following 


Scai>, from least severe to most severe: 
l. Momentary slowdown - nc data loss 
2. Loss of data (a network message) 


3. Temporary loss of some IMP function (a network link) 
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4, Momeitary total IMP outage with local self-recovery 
5. Outage requiring reloading via the network 


6. Failure requiring human insight for debugging. 


Exampie 1. Suppose first that a bus coupler experiences a tran- 
sient failure on a single reference to common memory, which leaves 
one word of commcn memory with the wrong contents but cor)e2ct par- 
ity. Suppose further that the failure is subtle, in the sense 

that there 2s no obvious, ill effect on processor control, dike 
Ralting or lovping, which will be caught by lower level mechanisms. 
We will focus first on examples which cause minimal disruption and 
Where detection and gentle recovery are the primary concerns. We 


consider four examples of transient memory failures: 


Example 1.a Suppose that a word of text in one of the messages 
we are delivering becomes smashed. There is a checksum on all 
messages and the network will notice at one of its checkpoints 
that the message has gone bad. The source will be prompted to 


send a new copy. (Severity 2) 


Example 1.b Near the heart of our system is & queue of unused 
buffers called the free ast. Suppose ine: fal lune is. in cthe -struc= 
ture Of this queue. The sy::tem cxplicitiy tests for both a looped 
queue and wrong things on the queue. A Inore subtle form of error 
occurs when some buffers which should be on the queue are missing 
from it. Our system is designed so tnat a buffer should be re- 
mMOVeO from Ene aree list ors woel Vwoeminubes 16 on taney of 
timer is maintained on each buffer, which is reset whenever the 
DUPICr LeUUric (CO Che tree fist nO d sony seme rer ver oulecout., 
tts Durter is forced back cito the Tree dist. The result of Sis 


failure will be a degradation of system performance as it attempts 
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to run with fewer buffers fc. a short while, followed by complete 
recovery within two minutes. The IMP will stay up and no messages 
will be lost. (Severity 1) 


Example 1.c Suppose that one or the locks on a resource is broken 
so that it wrongly locks the resource. Any Subsystem which tries 
to use the resource will put a processor into a tight loop waiting 
for the resource to become free. In about 1/15 sec. this will 
cause the processor's timer, driven off its 60Hz clock interrupt, 
to run out. Upon investigation, the program will notice that the 
subsystem is waiting for a locked resource, and arbitrarily un- 
lock it. Aside from the 1/15 sec. pause, the system will be un- 
affected by the transient. (Severity 1) 


Example _1.d Suppose now that a failure strikes common memory 

ho ding code, and that the trouble is subtle — either the code 

is not run often or the change has no immediate drastic effect. 

In a few seconds the processors will begin to notice that the 
checksum on that piece of code is bad and stop running it. Shortly 
the whole Consensus will agree, and will switch over to use the 
memory holding the spare copy of that code. Unless the broken 

code has already caused some other trouble, the problem is thereby 
fixed, with only momentary slowdown. (Severity 1) 


Example 2. Suppose a processor fails by suddenly and permanently 
Stopping. The immediate effect will be that some task will be left 
half done, with a high probability that some resource is locked. 
This looks to the system like a data failure, as in examples l.a, 
l.b, and l.c abev>. The recovery will be identical. In a few 
seconds the Consensus will notice that the processor has dropped 
out and processor recovery logic will be invoked. Since the 
processor is solidly broken the recovery will be unsuccessful, 


28 


a ATR RET I tC ert tte ste INA ct eset vtec eft Doe ne enor C FH UA Nabors SHROENAAeLGer nbeteetiine® Snare tine? 


Ta a LU Be 


= 
Zz 
& 
=z 
z 
= 
¥ 
z 


Report No. 2988 Bolt Beranek and Newman Inc. 


and the system will settle into a mode where every so often re- 
covery is retried. Eventually a repairman will fix the processor, 
at which time recovery will proceed and the processor will rejoin 
Consensus. It is hard to predict whether the IMP system will 
momentarily go down because of the failure; experience indicates 
that it usually stays up, but our experience is limited to lightly 
loaded machines. (Severity 2-4) 


Example 3. Suppose a power supply for a processor bus breaks. 
This is similar to the failing processor described above except 
that both processors on the bus are affectzd and the processors 
are given a hardware warning sufficiently far in advance that they 
can halt cleanly. The system will surely stay up through this 
failure. (Severity 1) 


Example 4. Now consider a case in which some page of common 

memory ceases to answer when referenced. Each processor will get 
a hardware trap when it tries to use that memory, forcing it di- 
rectly to the code which routinely verifies the environment. As 
a result, the failing memory will be deleted from the memory list 
by the Consensus and another module will be pressed into service 


to take its place. 


If the failed page contained code, a spare copy will normally 
be availabl> and a new spare copy will be made if possible. If 
it conteined data, an unused page will be pressed into service. 
In either case, the system will be reinitialized, momentarily 
bringing the IMP system down. If the failed page contained the 
Consensus communication area, a mew Consensus must be formed and 


thus recovery will take a little longer. (Severity 4) 
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Example 5. Let us now consider a failure of the PID. Suppose 
that the PID reports a task not previously set. When invoked, 
each strip checks to make sure that it is reasonable for the strip 
to be run. If not, another task is sought. Suppose instead 

that the PID "drops" a task. A special process periodically sets 
all PID flags independent of what needs to be done. This causes 
no harm, because superfluous tasks will be ignored (as described 
above), and serves to pick up such dropped tasks. Thus we have 
both a consistency check on redundant information and a timer 
built into our use of the PID. If a PID fails solidly, another 
PID will be switched in to operate the system. Transient failures 
cause slowdown; switchover may moments::ly bring down the IMP 
system. (Severity 1, 4) 


All of this leads to a slightly different image of the PID. 
Instead of being the central task disburser, with all processors 
relying on it to tell them what to do, the PID is a guide, sug- 
gesting to processors that if they look in a certain piace, they 
will probably find some useful work to do. The system would in 
fact run without a PID, albeit much more slowly and inefficiently. 


Example 6. Suppose a halt instruction somehow gets planted in 
common memory and that all processors execute it and stop. There 
is thus no Ceonsensus left to come to the rescue. Furthermore, 
60Hz interrupts are ineffective in a halted processor. After one 
second of inactivity, the bus arbiter timer will reset the pro- 
cessors, making them once more eligible for 60Hz interrupts which 
will restart them. BPefore the broken code is run, it will be 
cnecksummed, the discrepancy found, and a spare ccpy used. 
(Severity 2-4) 
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Example 7. Let us consider now what happens when, in common mem- 
ory, an end test for a storing loop is destroyed, causing each 
processor to wipe out its 60Hz interrupt code in local memory. 

im this case not. only vare "there no processors, left to hetp. but 
the 60Hz interrupt will not help either, since the interrupt code 


itself ts broken. This is a case in which the machine is incapar’ 
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of rescuing itself and will go off the network as a working node. 
When the Network Control Center notices that the IMP is no longer 
up, it will commence an external reload, restoring the IMP to op- 


eration. (Severity 5) 


Example 8. Consider the case of a processor whose hardware is 
solidly broken such that it repeatedly stores a zero into a loca- 
tion in common memory. Mechanisms described above will repeatedly 
fix the changed location, but it is desirable to eliminate the 
continuing presence .f this disrupting, influence. The Consensus 
will notice that one of its number has dropped out and wiil en- 
deavor Uo help the errant. processor. After a tew tries, a longer 
timer will run out, and the Consensus will take c more drastic 
action: final amputation. In this case there will be a rather 
lengthy IMP outage out the system will recover without external 


help. (Severity 4) 


Exampie: 9. One fallure trom whien: there is: no recovery, “either 
automatic or remote, iS a program which impersonates normal be- 
havior but is still somehow incorrect. That is, it hoids off the 
right timers, has a valid checksum, and simulates enough normal 
behavior so that higher leveis (e.g., the NCC) are satisfied. For 
example, if it were not for the fact that the CC explicitly checks 
the version number of the program running in each IMP, a previous, 
compatible, but obsolete version of the program would exhihit 

this behavior. (Severity 6) 
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L 
Example 10. Another class of failures which is hard to isolate | 
and deal with is low-frequen-y intermittents. Consid-r the case ; 
of a single processor which is broken such that its indexed shift 


instruction performs incorrectly. Since this instruction only 

occurs in some infrequently executed procedures, the failure only 
manifests itself, on the average, once every period t. If t is 

large, for instance one year, then we can safely disregard the f 
error, since its frequency is in the range of other failures over L 
which we have no control. If t is small, say 100 milliseconds, 
then the Consensus will isolate the bad processor and excise it. 


At some intermediate frequency, however, the Consensus will fail 


poe em 


to correlate successive failures and will instead treat each as 
a separate transient. The system will repeatedly fail and re- 
cover until some human intervenes. (Severity 6) 


2.6 Results and Conclusions 


Some strategies and techniques for building a reliable multi- 
processor have been described above. We have, in fact, actually 
ruilc and programmed such a machine using these strategies. We 
have found this machine straightforward to debug, both in hardware 
and software. Furthermore, the system continues to operate when 
individual power supplies are turned off, when memory locations 
are altered, when cables and cards are torn out, and through a 
variety of other failures. We have yet to establish field per- 
formance (which must be measured both in rate of recoverable in- 
cidents and in rate of unrecoverable failures), but we are now 
beginning to collect this information. 


While we have discussed principles of Pluribus reliability 
in terms of a specific application (the IMP), most of the concepts 
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are apnlication independent. We have been able to separate the 
application code from the reliability subsystems and, in fact, we 
are using the reliability subsystems intact in another application 
of the Pluribus machine. 
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